The experiment

Learning objectives:

  1. Background on the copepod selection experimental design
  2. Think about the hypotheses we can address with this experiment
  3. Cover how bisulfite sequencing data differs from regular data and not to panic when you see your fastqc output
  4. Begin mapping using Bismark

Copepod selection experiment

Acartia tonsa is a calanoid copepod that has a world wide distribution. It inhabits estuaries and coastal waters and it typically the most abundant zooplankton present. Because of their huge population sizes and global distribution, they play an important role in ocean biogeochemical cycling and ecosystem functioning. For example, they’re an important food base for many fishes. Given their broad distribution in dynamic habitats (estuaries), they can tolerate and live in a broad range of salinities, freshwater to seawater, and temperatures, among other stressors.

We know that the world is rapidly changing due to human activities and it is important to understand how species will respond to these changes. We were interested in understanding how A. tonsa could respond to stressful environments that will likely be caused by climate change. Can they adapt to rapid changes in temperature and acidity? How might epigenetic responses interact with adaptation to enable resilience?

A. tonsa is a great system to ask these questions for a number of reasons. First, their generation time is only ~10-20 days and they’re tiny! Only 1-2 mm. This means we can easily execute multi-generational studies with thousands of individuals. Second, because they can tolerate such variable environments, they should have lots of plasticity to respond to environmental fluctuations. Finally, their large population sizes mean that genetic diversity should be high, giving them high adaptive potential. This means that if any species can respond to rapid environmental change, it is likely A. tonsa.

Experimental design

A. tonsa were collected from the wild, common gardened for three generations, and then split into four treatments with four replicates each and about 3,000-5,000 individuals per replicate. They were left to evolve in these conditions for 25 generations. Samples were collected at generation F0 and F25 to quantify methylation frequencies using reduced representation bisulfite sequencing (RRBS).

The experimental design

The experimental design

Sample ID Treatment Trt abrreviation Generation
AA_F00_1 Control: ambient temp, ambient CO2 AA F00
AA_F25_2 Control: ambient temp, ambient CO2 AA F25
AH_F25_3 ambient temp, high CO2 AH F25
HA_F25_1 high temp, ambient CO2 HA F25
HH_F25_1 high temp, high CO2 HH F25

RRBS

  • Stands for reduced representation bisulfite sequencing
  • Following the adapter ligation, we bisulfite convert all unmethylated C’s to T’s.

  • Before starting we also spiked in a small amount of DNA from E. coli that we know wasn’t methylated. Using this, we can calculate downstream how efficient our bisulfite conversion was.

How RRBS works

How RRBS works

methylKit capabilities

Flowchart for capabilities of methylKit package

Flowchart for capabilities of methylKit package

Pipeline

  1. Visualize, clean, visualize
  • Visualize quality of raw data with fastqc
  • clean raw data with Trimmomatic
  • Visualize quality of cleaned data with fastqc
  1. Align to Acartia tonsa reference genome (Bismark)
  • also align lambda DNA to check for conversion efficiency (Already Done)
  1. Extract methylation calls
  2. Process and filter calls
  3. Summary figures (PCA, clustering, etc)
  4. Test for differential methylation (Methylkit)

Bisulfite sequence data looks weird

Unlike the normal sequence reads, in bisulfite conversion there is a high no. of T bases. This increase is because of the fact that all the non-methylated cytosines are converted to thyamine bases. While the that the reverse strand is the reverse complement of the bisulfite converted forward strand, and is not just the BS converted bottom strand. Thus reverse read being complement to the forward read will have a higher no. of adenine bases.

Why bisulfite fastq visualization looks weird

Why bisulfite fastq visualization looks weird

FastQ report

FastQ for the bisulfite convereted forward read after trimming

FastQ for the bisulfite convereted forward read after trimming

FastQ for the bisulfite convereted reverse read after trimming

FastQ for the bisulfite convereted reverse read after trimming

Bismark

  • Align to Acartia tonsa reference genome using Bismark
  • Align lambda DNA of E. coli ro check for the bisulfite conversion rate (~98%)
    Location of data stored:
    cd /data/project_data/epigenetics/reference_genome/Bisulfite_Genome/

Sample assigned
- AH_F25_4

What is Bismark

  • Bismark is a set of tools for the time-efficient analysis of Bisulfite-Seq (BS-Seq) data.
  • Bismark performs alignments of bisulfite-treated reads to a reference genome and cytosine methylation calls at the same time.
  • Bismark is written in Perl and is run from the command line.
  • Bisulfite-treated reads are mapped using the short read aligner Bowtie 1 (Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25), or alternatively Bowtie 2, and therefore it is a requirement that Bowtie 1 (or Bowtie 2) are also installed on your machine (see Dependencies).

All files associated with Bismark as well as a test BS-Seq data set can be downloaded from: http://www.bioinformatics.babraham.ac.uk/projects/bismark/

General form

USAGE: bismark [options] <genome_folder> {-1 <mates1> -2 <mates2> | <singles>}

ARGUMENTS:

  • <genome_folder>
    • The path to the folder containing the unmodified reference genome as well as the subfolders created by the Bismark_Genome_Preparation script (/Bisulfite_Genome/CT_conversion/ and /Bisulfite_Genome/GA_conversion/).
    • Bismark expects one or more fastA files in this folder(file extension: .fa, .fa.gz or .fasta or .fasta.gz). The path can be relative or absolute.
    • The path may also be set as --genome_folder /path/to/genome/folder/.
  • -1 <mates1>
    • Comma-separated list of files containing the #1 mates (filename usually includes"_1"), e.g. flyA_1.fq,flyB_1.fq). Sequences specified with this option must correspond file-for-file and read-for-read with those specified in .
    • Reads may be a mix of different lengths. Bismark will produce one mapping result and one report file per paired-end input file pair.
  • -2 <mates2>
    • Comma-separated list of files containing the #2 mates (filename usually includes "_2"), e.g. flyA_2.fq,flyB_2.fq). Sequences specified with this option must correspond file-for-file and read-for-read with those specified in .
    • Reads may be a mix of different lengths.
  • <singles>
    • A comma- or space-separated list of files containing the reads to be aligned (e.g. lane1.fq,lane2.fq lane3.fq). Reads may be a mix of different lengths.
  • Bismark will produce one mapping result and one report file per input file.
  • Please note that one should supply a list of files in conjunction with –basename as the output files will constantly overwrite each other…

Bash code

  • type bismark in bash to start the program

  • bismark -help for help with the wrapper

  • --output_dir - one output directory for all the data

  • forward read are - _1
  • reverese reads are - _2
  • adding read group
    • --rg_tag
    • --rg_id - sample id
  • --gzip - compress
  • --local - local alignment / soft clipping of the bases
  • --maxins - max insert size

#!/bin/bash

bismark --bowtie2 --multicore 1 \
    --genome /data/project_data/epigenetics/reference_genome \
    --output_dir /data/project_data/epigenetics/aligned_output \
    -1 /data/project_data/epigenetics/trimmed_fastq/AH_F25_4_1.fq.gz \
    -2 /data/project_data/epigenetics/trimmed_fastq/AH_F25_4_2.fq.gz \
    --rg_tag --rg_id AH_F25_4 --rg_sample AH_F25_4 --gzip --local --maxins 1000

Bismark is, at its core, running bowtie2. But there are important differences. If you remember, we’re using two modified versions of the genome where we’ve converted C-to-T AND G-to-A. We also generate temporary files of our trimmed reads where we convert C-to-T (read1) AND G-to-A (read2) so they can map to this converted genome. This makes the alignment a bit harder because : 1. the complexity of DNA is reduced
2. we are mapping to two separate genomes.

Parameters

--bowtie2 tells bismark to map with bowtie2. There are other options possible here (bowtie1, for example)

--multicore 1 Use one core. We could use multiple cores to make alignment faster, but we may crash the server if many of you are mapping at once if we do this.

--genome the location of our converted genome. The bismark package has a command to prep your genome that I’ve already done. You can check it out: bismark_methylation_extractor --help
-1 the path to your sample’s forward read
-2 the path to your sample’s reverse read

--rg_tag add a read group tag that identifies your individual sample in your output bam file

--rg_id the string that defines the readgroup ID

--rg_sample the string that defines your readgroup sample ID
--gzip for the temporary files, use gzip to save space
--local align with the local alignment option in bowtie2. This will include soft clipping, which should increase mapping rate, but comes at the cost of (maybe) increasing mis-mapping.

--maxins specifies the maximum insert size for mapping. We’re using 1000 as our libraries had a mean insert size of ~200. Note that this is smaller than most sequencing libraries.

#!/bin/bash

bismark --bowtie2 --multicore 1 \
    --genome /data/project_data/epigenetics/reference_genome \
    --output_dir /data/project_data/epigenetics/aligned_output \
    -1 /data/project_data/epigenetics/trimmed_fastq/SAMPLEID_1.fq.gz \
    -2 /data/project_data/epigenetics/trimmed_fastq/SAMPLEID_2.fq.gz \
    --rg_tag --rg_id SAMPLEID --rg_sample SAMPLEID --gzip --local --maxins 1000

Here the SAMPLEID was replaced with my sample id AH_F25_4

Mapping reads

  • zcat to read the .gz files

On Mac, use the following: zcat < AA_F00_1_1_bismark_bt2_pe.bismark.cov.gz

Mapping rate
- The mapping rate is very high in AA_F00, the control at zero generations, which can be worrying. This might be due to the fact that slightly more samples went into sequecing AA_F00.
- So including more DNAs might actually help increase the mapping rate.
- Maybe mapping them as single ends rather than paired end might help increase the mapping rate.
- THe bisulfite conversion rate was checked and found out to be 98-99%, when compared with the E.coli that was spiked in.

Mapping rates for different treatements

Mapping rates for different treatements

Extract methylation calls

bismark_methylation_extractor

USAGE: bismark_methylation_extractor [options] <filenames>

A typical command to extract context-dependent (CpG/CHG/CHH) methylation could look like this:
bismark_methylation_extractor -s –comprehensive
test_dataset.fastq_bismark.sam
This will produce three output files:
(a) CpG_context_test_dataset.fastq_bismark.txt
(b) CHG_context_test_dataset.fastq_bismark.txt
(c) CHH_context_test_dataset.fastq_bismark.txt

Methylation report

bismark_methylation_extractor --bedGraph --scaffolds --gzip \
    --cytosine_report --comprehensive \
    --no_header \
    --parallel 6 \
    --output ~/tonsa_epigenetics/analysis/methylation_extract/ \
    --genome_folder /data/copepods/tonsa_genome/ \
    *pe.bam

Methylation extraction report

M-Bias Plot
- Methylation Bias Plot
- CpG methylation is the dark line across the length of the read, and we are intersted in this line
- CpG methylation is same across the read, which is what we expect to see
- CpG reads are very high at the begining for read 1, we are unsure of what it is so we are going to cut the first two bases to remove this bias.
- It is generally recommended to remove this bias
- The same spike was also seen in the spiked E.coli
- The mechanism that is causing it is very unclear
- Generally considered as a technical artifact rather than a biological artifact
- Read 2 shows a bit of a dip in CpG at the beginning

Methylation report ignoring bias

To ignore the first two bases:
- Done to remove the high CpG read at the begining of the reads

bismark_methylation_extractor --bedGraph --scaffolds --gzip \
    --cytosine_report --comprehensive \
    --no_header \
    --parallel 6 \
    --ignore 2 --ignore_r2 2 \
    # this says to ignore the first two bases
    --output ~/tonsa_epigenetics/analysis/methylation_extract/ \
    --genome_folder /data/copepods/tonsa_genome/ \
    *pe.bam

Methylation extraction report ignoring bias

bash

We can look at sites with some data using the following:

zcat HH_F25_4_1_bismark_bt2_pe.bismark.cov.gz | head

Column 1: Chromosome
Column 2: Start position
Column 3: End position
- Both start and stop column of the base we are intersted in are going to be same, since we are looking at single bases
Column 4: Methylation percentage
Column 5: Number of methylated C’s
Column 6: Number of unmethylated C’s

zcat HH_F25_4_1_bismark_bt2_pe.bismark.cov.gz | awk '$4 > 5 && $5 > 10' | head
  • says column 4 values > 5

  • says column 5 > ten methylated reads

methylKit

R code

Packages required:

require(tidyverse)
require(pheatmap)

# fixing issues with methylKit package installation
# BiocManager::install("methylKit")
# install.packages("tibble", dependencies = TRUE, INSTALL_opts = '--no-lock')
require(methylKit)
require(plotly)
require(patchwork)
# first, we want to read in the raw methylation calls with methylkit

# set directory with absolute path (why is this necessary? I have no idea, but gz files wont work with relative paths)
dir <- "C:\\Epigenetics_data\\Epigenetics_data\\"

list.files(dir)  
##  [1] "AA_F00_1_1_bismark_bt2_pe.bismark.cov.gz"
##  [2] "AA_F00_2_1_bismark_bt2_pe.bismark.cov.gz"
##  [3] "AA_F00_3_1_bismark_bt2_pe.bismark.cov.gz"
##  [4] "AA_F00_4_1_bismark_bt2_pe.bismark.cov.gz"
##  [5] "AA_F25_1_1_bismark_bt2_pe.bismark.cov.gz"
##  [6] "AA_F25_2_1_bismark_bt2_pe.bismark.cov.gz"
##  [7] "AA_F25_3_1_bismark_bt2_pe.bismark.cov.gz"
##  [8] "AA_F25_4_1_bismark_bt2_pe.bismark.cov.gz"
##  [9] "AH_F25_1_1_bismark_bt2_pe.bismark.cov.gz"
## [10] "AH_F25_2_1_bismark_bt2_pe.bismark.cov.gz"
## [11] "AH_F25_3_1_bismark_bt2_pe.bismark.cov.gz"
## [12] "AH_F25_4_1_bismark_bt2_pe.bismark.cov.gz"
## [13] "bed_files"                               
## [14] "checklist.chk"                           
## [15] "HA_F25_1_1_bismark_bt2_pe.bismark.cov.gz"
## [16] "HA_F25_2_1_bismark_bt2_pe.bismark.cov.gz"
## [17] "HA_F25_3_1_bismark_bt2_pe.bismark.cov.gz"
## [18] "HA_F25_4_1_bismark_bt2_pe.bismark.cov.gz"
## [19] "HH_F25_1_1_bismark_bt2_pe.bismark.cov.gz"
## [20] "HH_F25_2_1_bismark_bt2_pe.bismark.cov.gz"
## [21] "HH_F25_3_1_bismark_bt2_pe.bismark.cov.gz"
## [22] "HH_F25_4_1_bismark_bt2_pe.bismark.cov.gz"
## [23] "methylBase_united.txt.bgz"               
## [24] "methylBase_united.txt.bgz.tbi"           
## [25] "sample_id.txt"
# read in the sample ids
samples <- read.table("C:\\Epigenetics_data\\Epigenetics_data\\sample_id.txt",
                      header = FALSE)
# now point to coverage files
files <- file.path(dir, samples$V1) # pasting the directory and samples together  
all(file.exists(files)) # to check if the path is real and points to the correct location 
## [1] TRUE
# convert to list
file.list <- as.list(files) # for methylKit to run properly  
# get the names only for naming our samples
nmlist <- as.list(gsub("_1_bismark_bt2_pe.bismark.cov.gz","",samples$V1))
head(nmlist)
## [[1]]
## [1] "AA_F00_1"
## 
## [[2]]
## [1] "AA_F00_2"
## 
## [[3]]
## [1] "AA_F00_3"
## 
## [[4]]
## [1] "AA_F00_4"
## 
## [[5]]
## [1] "AA_F25_1"
## 
## [[6]]
## [1] "AA_F25_2"
# use methRead to read in the coverage files
## converting to a methylKit object from the list made  
myobj <- methRead(location = file.list,   # path to the files stored
                  sample.id = nmlist,     # sample ID info 
                  assembly = "atonsa",    # this is just a string. no actual database
                  dbtype = "tabix",       # going to make a database in the form of .txt files
                                          # this is also good for saving up on the memory
                  context = "CpG",      
                  resolution = "base",    # resolution is for SNPs so at a base elevel
                  mincov = 20,            # min 20 reads per site to get accurate estimation
                                          # 10 is an acceptable value too
                  treatment = c(0,0,0,0,  # four AA_F00 files
                                1,1,1,1,  # four AA_F25 files
                                2,2,2,2,  # four AH_F25 files
                                3,3,3,3,  # four HA_F25 files
                                4,4,4,4), # four HH_F25 files
                  pipeline = "bismarkCoverage", # input data format
                  dbdir = "C:\\Epigenetics_data\\") # not necessary to be specified
######
# visualize coverage and filter
######

# We can look at the coverage for individual samples with getCoverageStats()
getCoverageStats(myobj[[1]], plot = TRUE, both.strands = FALSE) #Get CpG coverage information

# numbers on bars denote what percentage of locations are contained in that bin
# Experiments that are highly suffering from PCR duplication bias will have 
# a secondary peak towards the right hand side of the histogram

myobj[[1]] # no. of Cs and Ts should equal coverage
## methylRawDB object with 67815 rows
## --------------
##          chr start   end strand coverage numCs numTs
## 1 LS041577.1 13445 13445      *       24     0    24
## 2 LS041577.1 13454 13454      *       24     1    23
## 3 LS041577.1 13457 13457      *       24     0    24
## 4 LS041577.1 13469 13469      *       24     0    24
## 5 LS041577.1 13472 13472      *       24     0    24
## 6 LS041577.1 13490 13490      *       23     0    23
## --------------
## sample.id: AA_F00_1 
## assembly: atonsa 
## context: CpG 
## resolution: base 
## dbtype: tabix
# looking at percent methylation 
getMethylationStats(myobj[[1]],plot=TRUE,both.strands=FALSE)

# and can plot all of our samples at once to compare.
# par(mfrow=c(5,4))
# 
# for(i in 1:length(myobj)) {
#   getCoverageStats(myobj[[i]], plot = TRUE, both.strands = FALSE) #Get CpG coverage information
# } #Plot and save %CpG methylation information
# 
# par(mfrow=c(1,1))
The histogram of CpG coverage for all the samples

The histogram of CpG coverage for all the samples

# filter samples by depth with filterByCoverage()
filtered.myobj <- filterByCoverage(myobj,
                                   lo.count = 20, lo.perc = NULL,
                                   hi.count = NULL, hi.perc = 97.5,
                                   db.dir = "C:\\Epigenetics_data\\")
######
# merge samples
######

#Note! This takes a while and we're skipping it

# use unite() to merge all the samples. We will require sites to be present in each sample or else will drop it

# don't run as it takes a lot of time  
# meth <- unite(filtered.myobj, 
#                mc,cores=3, 
#                suffix="united", # add united at the end of the name
#                db.dir="C:\\Epigenetics_data\\")

Methylkit has a convenient aspect where we can load previously generated databases. This means we don’t have to re-run analyses, but can skip ahead to where we previously left off. We will do this below to load the united database that I’ve already generated.

meth <- methylKit:::readMethylBaseDB(
                      dbpath = "C:\\Epigenetics_data\\Epigenetics_data\\methylBase_united.txt.bgz",
                            dbtype = "tabix",
                            sample.id =   unlist(nmlist),
                            assembly = "atonsa", # this is just a string. not an actual database
                            context = "CpG",
                            resolution = "base",
                            treatment = c(0,0,0,0,
                              1,1,1,1,
                              2,2,2,2,
                              3,3,3,3,
                              4,4,4,4),
                            destrand = FALSE)

Methylation statistics across samples

# percMethylation() calculates the percent methylation for each site and sample
pm <- percMethylation(meth)
head(pm) # methylation percentage for each site per each sample
##      AA_F00_1 AA_F00_2 AA_F00_3 AA_F00_4  AA_F25_1  AA_F25_2  AA_F25_3
## [1,] 0.000000 5.000000 7.692308 2.564103  1.123596 3.0000000 3.9603960
## [2,] 4.166667 2.500000 0.000000 0.000000 10.112360 0.9900990 0.9708738
## [3,] 0.000000 2.564103 3.846154 0.000000  3.370787 0.9803922 0.9615385
## [4,] 0.000000 2.500000 3.846154 0.000000  1.086957 2.9411765 0.0000000
## [5,] 0.000000 2.500000 7.692308 0.000000  1.086957 2.9702970 0.0000000
## [6,] 0.000000 2.500000 3.846154 0.000000  2.150538 0.0000000 1.9417476
##       AA_F25_4  AH_F25_1  AH_F25_2  AH_F25_3  AH_F25_4 HA_F25_1 HA_F25_2
## [1,] 2.8735632 3.6866359 2.2388060 2.5423729 4.1095890 9.375000 2.816901
## [2,] 1.1494253 0.0000000 3.6764706 0.0000000 1.3698630 3.125000 1.408451
## [3,] 1.1494253 0.9090909 0.7407407 0.8474576 2.7397260 0.000000 0.000000
## [4,] 3.9772727 2.2727273 0.7352941 0.0000000 0.6849315 0.000000 0.000000
## [5,] 0.5714286 0.4608295 2.2058824 1.6949153 2.0689655 0.000000 1.408451
## [6,] 3.4090909 1.8348624 0.7299270 0.8403361 1.3698630 4.477612 0.000000
##      HA_F25_3 HA_F25_4 HH_F25_1 HH_F25_2 HH_F25_3 HH_F25_4
## [1,] 3.030303 2.439024 4.761905        0 1.470588 0.000000
## [2,] 0.000000 1.219512 1.190476        0 0.000000 2.083333
## [3,] 1.515152 1.219512 3.571429        0 1.470588 4.166667
## [4,] 0.000000 1.219512 1.190476        0 0.000000 0.000000
## [5,] 0.000000 0.000000 1.204819        0 5.882353 4.166667
## [6,] 2.898551 2.439024 0.000000        0 1.470588 2.083333
#plot methylation histograms
ggplot(gather(as.data.frame(pm)), 
              aes(value)) + 
              geom_histogram(bins = 10, color="black", fill="grey") + 
              facet_wrap(~key)

- none of these are fully methylated
- biological reason - pools of individuals, so if any slight change in one individual can affect the overall methylation levels for any individual site
- differenes can also be present because of differenes due to different cell types
- this U pattern is quite common in model organisms, with high methyaltion at zero and 100, and very low in between
- The reason could also be technical

# calculate and plot mean methylation
sp.means <- colMeans(pm)

sp.means
## AA_F00_1 AA_F00_2 AA_F00_3 AA_F00_4 AA_F25_1 AA_F25_2 AA_F25_3 AA_F25_4 
## 41.09502 41.61804 41.50262 40.83576 36.30440 38.37196 39.92734 40.73100 
## AH_F25_1 AH_F25_2 AH_F25_3 AH_F25_4 HA_F25_1 HA_F25_2 HA_F25_3 HA_F25_4 
## 38.25740 36.03000 38.19356 39.57069 39.13568 40.87818 38.46266 38.78108 
## HH_F25_1 HH_F25_2 HH_F25_3 HH_F25_4 
## 37.59602 38.26307 38.42900 39.20101
dim(pm) # methlation for 15158 sites in 20 samples
## [1] 15158    20
p.df <- data.frame(sample=names(sp.means),
          group = substr(names(sp.means), 1,6),
          methylation = sp.means)

ggplot(p.df, aes(x=group, y=methylation, color=group)) + 
    stat_summary(color="black") + geom_jitter(width=0.1, size=3) +
  ylab("Percent Methylation") + xlab("Treatment") + theme_bw() + 
  theme(axis.text=element_text(size=14),
        axis.title=element_text(size=15,face="bold"))
## No summary function supplied, defaulting to `mean_se()`

## is the high methylation rate high in AA_F00 or is it due to it having higher mapping rate compared to the rest of the samples?? - hard to know  
  • mean and std error are represented in the plot
  • methylation rate is higher in AA_F00 group
    • is this due to higher mapping rate that was observed for AA_F00 before
    • hard to know if a better library can lead to higher methylation rate score or tighter distribution with just this set of samples
  • mean methylation for each sample means the average methylaton rate across the whole genome or the 15000 sites studied here
    • average of every site per sample
    • we don’t want a large varation across the means
  • in response to an environmental conditions the global methylation rate can increase or decrease; is there a genome wide response to being in the lab for 25 generations

Summarize variation: PCA, clustering

PCA

# sample clustering
# looking for similarities and differences between the samples / treatments  
clusterSamples(meth,
               dist = "correlation", 
               method = "ward.D",    
               plot = TRUE)

## 
## Call:
## hclust(d = d, method = HCLUST.METHODS[hclust.method])
## 
## Cluster method   : ward.D 
## Distance         : pearson 
## Number of objects: 20
# they are not clustering by treatments  
# F00 samples are clustering together  but not so much in other treatments 

# PCA
PCASamples(meth)

# similarly messy  

find differentially methylated sites between two groups

# subset with reorganize()

meth_sub <- reorganize(meth,
                sample.ids =c("AA_F00_1","AA_F00_2","AA_F00_3", "AA_F00_4",
                              "HH_F25_1","HH_F25_2","HH_F25_3","HH_F25_4"),
                treatment = c(0,0,0,0,1,1,1,1), # subsetting to just 8 samples
                save.db=FALSE) 

# calculate differential methylation

myDiff <- calculateDiffMeth(meth_sub,
                            overdispersion = "MN",
                            mc.cores = 1,
                            suffix = "AA_HH",
                            adjust = "qvalue", # FDR correction - commonly applied to genomic analysis  
                            test = "Chisq") # chi square 
## two groups detected:
##  will calculate methylation difference as the difference of
## treatment (group: 1) - control (group: 0)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# this model of overdispersion and test statistic seems to be the best
  • fit a logistic regression to methylation values where explanatory variable is the treatment (case or control).
    • and we compare the fit of the model with explanatory variable vs the null model (no explanatory variable) and ask if the fit is better using a Chisq test.
  • the methylation proportions are weighted by their coverage, as in a typical logistic regression. Note that in theory you could enter these as two column success and failure data frame, which is common in logistic regressions.

  • use overdispersion: Chisq without overdispersion finds more true positives, but many more false positives. good compromise is overdispersion with Chisq.
    • reduced true pos, but really reduces false pos rate.
# get all differentially methylated bases
myDiff <- getMethylDiff(myDiff,
                        qvalue = 0.05, # taking out only the significant ones
                        difference = 10) # methylation differences that is taken to count as significant 
# 10 is an arbitary number- it can range from 10 to 25%

# we can visualize the changes in methylation frequencies quickly.

head(getData(myDiff)$meth.diff)
## [1] -10.16196 -17.85524 -13.96676 -13.15037  10.08852  10.08054
hist(getData(myDiff)$meth.diff) # methylation differences 

# hist before zero towards the left are hypomethylated and those towards the right are hypermethylated
# we see a lower methylation % in HH samples compared to AA saples

# get hyper methylated bases
hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper")
hist(getData(hyper)$meth.diff)

# get hypo methylated bases
hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo")
hist(getData(hypo)$meth.diff)

Plots of differentially methylated groups

#heatmaps first
pm <- percMethylation(meth_sub)

# make a dataframe with snp id's, methylation, etc.
sig.in <- as.numeric(row.names(myDiff))
pm.sig <- pm[sig.in,]

# add snp, chr, start, stop
din <- getData(myDiff)[,1:3]
df.out <- cbind(paste(getData(myDiff)$chr, 
                      getData(myDiff)$start, sep=":"), 
                din, pm.sig)
colnames(df.out) <- c("snp", 
                      colnames(din), 
                      colnames(df.out[5:ncol(df.out)]))
df.out <- (cbind(df.out,getData(myDiff)[,5:7]))

####
# heatmap
####

my_heatmap <- pheatmap(pm.sig,
        show_rownames = FALSE) # red is higher methylation

# we can also normalize 
ctrmean <- rowMeans(pm.sig[,1:4])

h.norm <- (pm.sig-ctrmean)

my_heatmap <- pheatmap(h.norm,
        show_rownames = FALSE)

##### if you want to change colors. only because I don't love the default colors.

paletteLength <- 50
myColor <- colorRampPalette(c("cyan1", "black", "yellow1"))(paletteLength)
myBreaks <- c(seq(min(h.norm), 0, length.out=ceiling(paletteLength/2) + 1), 
              seq(max(h.norm)/paletteLength, max(h.norm), length.out=floor(paletteLength/2)))
              
my_heatmap <- pheatmap(h.norm,
        color=myColor, 
        breaks=myBreaks,
        show_rownames = FALSE)

#####
#let's look at methylation of specific snps
####

# convert data frame to long form
head(df.out,3)
##               snp        chr start end AA_F00_1 AA_F00_2 AA_F00_3 AA_F00_4
## 23 LS041600.1:376 LS041600.1   376 376 84.37500 83.97436 84.28571 83.57143
## 29 LS041600.1:475 LS041600.1   475 475 77.89474 75.00000 89.04110 80.55556
## 58  LS042377.1:90 LS042377.1    90  90 61.85286 59.91903 58.59155 58.15085
##    HH_F25_1 HH_F25_2 HH_F25_3 HH_F25_4       pvalue       qvalue meth.diff
## 23 76.66667 77.47748 70.17544 70.00000 6.463564e-04 0.0086976959 -10.16196
## 29 45.45455 66.03774 67.50000 68.62745 3.809827e-03 0.0291370853 -17.85524
## 58 53.57143 46.09929 40.42553 49.15254 4.128226e-06 0.0002458728 -13.96676
df.plot <- df.out[,c(1,5:12)] %>% 
                pivot_longer(-snp, values_to = "methylation")

df.plot$group <- substr(df.plot$name,1,2)
head(df.plot)
## # A tibble: 6 x 4
##   snp            name     methylation group
##   <fct>          <chr>          <dbl> <chr>
## 1 LS041600.1:376 AA_F00_1        84.4 AA   
## 2 LS041600.1:376 AA_F00_2        84.0 AA   
## 3 LS041600.1:376 AA_F00_3        84.3 AA   
## 4 LS041600.1:376 AA_F00_4        83.6 AA   
## 5 LS041600.1:376 HH_F25_1        76.7 HH   
## 6 LS041600.1:376 HH_F25_2        77.5 HH
# looking at snp LS041600.1:376

df.plot %>% filter(snp=="LS041600.1:376") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")

## write bed file for intersection with genome annotation
write.table(file = "C:\\Epigenetics_data\\Epigenetics_data\\bed_files\\diffmeth.bed",
          data.frame(chr= df.out$chr, start = df.out$start, end = df.out$end),
          row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")

Experiment

AA_F25 and AH_F25

find differentially methylated sites between two groups

# subset with reorganize()

meth_sub <- reorganize(meth,
                sample.ids =c("AA_F25_1","AA_F25_2","AA_F25_3", "AA_F25_4",
                              "AH_F25_1","AH_F25_2","AH_F25_3","AH_F25_4"),
                treatment = c(0,0,0,0,1,1,1,1), # subsetting to just 8 samples
                save.db=FALSE) 

# calculate differential methylation
myDiff <- calculateDiffMeth(meth_sub,
                            overdispersion = "MN",
                            mc.cores = 1,
                            suffix = "AA_AH",
                            adjust = "qvalue", 
                            test = "Chisq") 
## two groups detected:
##  will calculate methylation difference as the difference of
## treatment (group: 1) - control (group: 0)
# get all differentially methylated bases
myDiff <- getMethylDiff(myDiff,
                        qvalue = 0.05, 
                        difference = 10) 

head(getData(myDiff)$meth.diff)
## [1] -13.38261 -12.50014  16.57960 -10.85679 -11.15066 -11.61221
hist(getData(myDiff)$meth.diff) # methylation differences 

# hist before zero towards the left are hypomethylated and those towards the right are hypermethylated
# we see a lower methylation % in AH_F25 samples compared to AA_F25 samples

# get hyper methylated bases
hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper")
dim(hyper)
## [1] 6 7
# get hypo methylated bases
hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo")
dim(hypo)
## [1] 24  7
par(mfrow=c(1,2))
hist(getData(hyper)$meth.diff)
hist(getData(hypo)$meth.diff)

par(mfrow=c(1,2))

Plots of differentially methylated groups

#heatmaps first
pm <- percMethylation(meth_sub)

# make a dataframe with snp id's, methylation, etc.
sig.in <- as.numeric(row.names(myDiff))
pm.sig <- pm[sig.in,]

# add snp, chr, start, stop
din <- getData(myDiff)[,1:3]
df.out <- cbind(paste(getData(myDiff)$chr, 
                      getData(myDiff)$start, sep=":"), 
                din, pm.sig)
colnames(df.out) <- c("snp", 
                      colnames(din), 
                      colnames(df.out[5:ncol(df.out)]))
df.out <- (cbind(df.out,getData(myDiff)[,5:7]))

####
# heatmap
####

my_heatmap <- pheatmap(pm.sig,
        show_rownames = FALSE) # red is higher methylation

# we can also normalize 
ctrmean <- rowMeans(pm.sig[,1:4])

h.norm <- (pm.sig-ctrmean)

my_heatmap <- pheatmap(h.norm,
        show_rownames = FALSE)

AH_heat <- my_heatmap

##### Recoloring the heatmaps.
paletteLength <- 50
myColor <- colorRampPalette(c("cyan1", "black", "yellow1"))(paletteLength)
myBreaks <- c(seq(min(h.norm), 0, length.out=ceiling(paletteLength/2) + 1), 
              seq(max(h.norm)/paletteLength, max(h.norm), length.out=floor(paletteLength/2)))
              
my_heatmap <- pheatmap(h.norm,
        color=myColor, 
        breaks=myBreaks,
        show_rownames = FALSE)

#####
#let's look at methylation of specific snps
####

# convert data frame to long form
head(df.out,3)
##                  snp        chr start  end AA_F25_1 AA_F25_2 AA_F25_3 AA_F25_4
## 1965 LS051659.1:1214 LS051659.1  1214 1214 62.50000 69.79866 68.31683 70.14218
## 2251  LS053241.1:414 LS053241.1   414  414 42.34694 41.46341 31.70732 39.47368
## 3451 LS068114.1:6322 LS068114.1  6322 6322 30.35714 28.57143 15.73034 21.60494
##      AH_F25_1 AH_F25_2 AH_F25_3 AH_F25_4       pvalue     qvalue meth.diff
## 1965 48.33333 49.47368 62.40602 56.06061 5.298836e-05 0.04100001 -13.38261
## 2251 30.18868 24.83660 27.77778 28.57143 1.235214e-04 0.04468703 -12.50014
## 3451 43.44262 33.00000 42.02899 39.13043 1.789182e-05 0.02076583  16.57960
df.plot <- df.out[,c(1,5:12)] %>% 
                pivot_longer(-snp, values_to = "methylation")

df.plot$group <- substr(df.plot$name,1,2)
head(df.plot)
## # A tibble: 6 x 4
##   snp             name     methylation group
##   <fct>           <chr>          <dbl> <chr>
## 1 LS051659.1:1214 AA_F25_1        62.5 AA   
## 2 LS051659.1:1214 AA_F25_2        69.8 AA   
## 3 LS051659.1:1214 AA_F25_3        68.3 AA   
## 4 LS051659.1:1214 AA_F25_4        70.1 AA   
## 5 LS051659.1:1214 AH_F25_1        48.3 AH   
## 6 LS051659.1:1214 AH_F25_2        49.5 AH
# looking at snp LS051659.1:1214

AH_snp <- df.plot %>% filter(snp=="LS051659.1:1214") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")
ggplotly(AH_snp)
## write bed file for intersection with genome annotation
# write.table(file = "C:\\Epigenetics_data\\Epigenetics_data\\bed_files\\AA_AH_diffmeth.bed",
#           data.frame(chr= df.out$chr, start = df.out$start, end = df.out$end),
#           row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
AH_snp1 <- df.plot %>% filter(snp=="LS274951.1:141") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")  +
            ylab("Percent Methylation") + xlab("") + theme_bw() + 
            theme(axis.text=element_text(size=14),
                  axis.title=element_text(size=15,face="bold"),legend.position = "none") +
            ggtitle("Endonuclease III-like protein 1 ") + ylim(0,95)


AH_snp2 <- df.plot %>% filter(snp=="LS068114.1:6322") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")  +
            ylab("") + xlab("") + theme_bw() + 
            theme(axis.text=element_text(size=14),
                  axis.title=element_text(size=15,face="bold"),legend.position = "none") +
            ggtitle("Deoxyuridine 5’-triphosphate nucleotidohydrolase") + ylim(0,95)


AH_snp3 <- df.plot %>% filter(snp=="LS328870.1:810") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")  +
            ylab("") + xlab("") + theme_bw() + 
            theme(axis.text=element_text(size=14),
                  axis.title=element_text(size=15,face="bold")) + 
            ggtitle("Retrovirus-related Pol polyprotein from transposon 17.6") + ylim(0,95)

Link to genes

/data/popgen/bedtools2/bin/bedtools closest -a /users/a/p/ap1/EcologicalGenomics/myresults/bedfiles_copepod/AA_AH_diffmeth.bed\
      -b /data/project_data/epigenetics/GCA_900241095.1_Aton1.0_genomic.fa_annotation_table.bed \
      -D b | \
      awk '!($10=="-")' > AA_AH_hits.bed # save filename is hits.bed
# count up number of hits

cat AA_AH_hits.bed | wc -l #18

# count number of unique named genes
cat AA_AH_hits.bed | cut -f 8 | sort | uniq -c

AA_F25 and HH_F25

find differentially methylated sites between two groups

# subset with reorganize()

meth_sub <- reorganize(meth,
                sample.ids =c("AA_F25_1","AA_F25_2","AA_F25_3", "AA_F25_4",
                              "HH_F25_1","HH_F25_2","HH_F25_3","HH_F25_4"),
                treatment = c(0,0,0,0,1,1,1,1), # subsetting to just 8 samples
                save.db=FALSE) 

# calculate differential methylation
myDiff <- calculateDiffMeth(meth_sub,
                            overdispersion = "MN",
                            mc.cores = 1,
                            suffix = "AA_HH",
                            adjust = "qvalue", 
                            test = "Chisq") 
## two groups detected:
##  will calculate methylation difference as the difference of
## treatment (group: 1) - control (group: 0)
# get all differentially methylated bases
myDiff <- getMethylDiff(myDiff,
                        qvalue = 0.05, 
                        difference = 10) 

head(getData(myDiff)$meth.diff)
## [1]  12.36264 -13.76137  31.97324  32.09734  10.66545  10.84294
hist(getData(myDiff)$meth.diff) # methylation differences 

# hist before zero towards the left are hypomethylated and those towards the right are hypermethylated
# we see a sightly higher methylation % in HH_F25 samples compared to AA_F25 samples

# get hyper methylated bases
hyper=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hyper")
dim(hyper)
## [1] 66  7
# get hypo methylated bases
hypo=getMethylDiff(myDiff,difference=10,qvalue=0.05,type="hypo")
dim(hypo)
## [1] 55  7
par(mfrow=c(1,2))
hist(getData(hyper)$meth.diff)
hist(getData(hypo)$meth.diff)

par(mfrow=c(1,2))

Plots of differentially methylated groups

#heatmaps first
pm <- percMethylation(meth_sub)

# make a dataframe with snp id's, methylation, etc.
sig.in <- as.numeric(row.names(myDiff))
pm.sig <- pm[sig.in,]

# add snp, chr, start, stop
din <- getData(myDiff)[,1:3]
df.out <- cbind(paste(getData(myDiff)$chr, 
                      getData(myDiff)$start, sep=":"), 
                din, pm.sig)
colnames(df.out) <- c("snp", 
                      colnames(din), 
                      colnames(df.out[5:ncol(df.out)]))
df.out <- (cbind(df.out,getData(myDiff)[,5:7]))

####
# heatmap
####

my_heatmap <- pheatmap(pm.sig,
        show_rownames = FALSE) # red is higher methylation

# we can also normalize 
ctrmean <- rowMeans(pm.sig[,1:4])

h.norm <- (pm.sig-ctrmean)

my_heatmap <- pheatmap(h.norm,
        show_rownames = FALSE)

HH_heat <- my_heatmap


##### Recoloring the heatmaps.
paletteLength <- 50
myColor <- colorRampPalette(c("cyan1", "black", "yellow1"))(paletteLength)
myBreaks <- c(seq(min(h.norm), 0, length.out=ceiling(paletteLength/2) + 1), 
              seq(max(h.norm)/paletteLength, max(h.norm), length.out=floor(paletteLength/2)))
              
my_heatmap <- pheatmap(h.norm,
        color=myColor, 
        breaks=myBreaks,
        show_rownames = FALSE)

#####
#let's look at methylation of specific snps
####

# convert data frame to long form
head(df.out,3)
##                 snp        chr start  end  AA_F25_1  AA_F25_2  AA_F25_3
## 124 LS042748.1:7742 LS042748.1  7742 7742  6.122449  8.139535 10.526316
## 262 LS043541.1:2223 LS043541.1  2223 2223 76.800000 77.500000 79.591837
## 393  LS043710.1:571 LS043710.1   571  571 32.258065 17.391304  6.569343
##      AA_F25_4 HH_F25_1 HH_F25_2 HH_F25_3 HH_F25_4       pvalue     qvalue
## 124  9.677419 19.60784 20.00000 29.72973 14.28571 0.0003768805 0.03738142
## 262 75.806452 75.67568 61.68224 66.21622 48.38710 0.0003671386 0.03721519
## 393 10.650888 29.16667 51.51515 59.09091 24.13793 0.0004571181 0.04267285
##     meth.diff
## 124  12.36264
## 262 -13.76137
## 393  31.97324
df.plot <- df.out[,c(1,5:12)] %>% 
                pivot_longer(-snp, values_to = "methylation")

df.plot$group <- substr(df.plot$name,1,2)
head(df.plot)
## # A tibble: 6 x 4
##   snp             name     methylation group
##   <fct>           <chr>          <dbl> <chr>
## 1 LS042748.1:7742 AA_F25_1        6.12 AA   
## 2 LS042748.1:7742 AA_F25_2        8.14 AA   
## 3 LS042748.1:7742 AA_F25_3       10.5  AA   
## 4 LS042748.1:7742 AA_F25_4        9.68 AA   
## 5 LS042748.1:7742 HH_F25_1       19.6  HH   
## 6 LS042748.1:7742 HH_F25_2       20    HH
# looking at snp LS042748.1:7742

HH_snp <- df.plot %>% filter(snp=="LS219431.1:243") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")
ggplotly(HH_snp)
# write bed file for intersection with genome annotation
# write.table(file = "C:\\Epigenetics_data\\Epigenetics_data\\bed_files\\AA_HH_diffmeth.bed",
#           data.frame(chr= df.out$chr, start = df.out$start, end = df.out$end),
#           row.names=FALSE, col.names=FALSE, quote=FALSE, sep="\t")
# summary table
df.plot_filter1 <- df.plot %>% filter(group=="AA") %>% 
                    group_by(snp) %>% 
                    summarize(avgMeth = mean(methylation, na.rm = TRUE))
df.plot_filter1$Group <- "AA"

df.plot_filter2 <- df.plot %>% filter(group=="HH") %>% 
                    group_by(snp) %>% 
                    summarize(avgMeth = mean(methylation, na.rm = TRUE))
df.plot_filter2$Group <- "HH"

df.plot_filtered <- merge(df.plot_filter1,df.plot_filter2, by="snp") 

df.plot_filtered <- df.plot_filtered %>% arrange(desc(avgMeth.x))
head(df.plot_filtered) # the highest average methylated snps
##               snp avgMeth.x Group.x avgMeth.y Group.y
## 1  LS219431.1:243  92.83155      AA  81.34244      HH
## 2 LS239438.1:1057  90.13634      AA  77.00420      HH
## 3  LS311600.1:407  89.29486      AA  76.59789      HH
## 4  LS346027.1:369  88.66368      AA  77.91239      HH
## 5  LS058076.1:184  85.31733      AA  75.20231      HH
## 6  LS313322.1:157  84.10741      AA  67.88553      HH
HH_snp1 <- df.plot %>% filter(snp=="LS049748.1:4514") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")  +
            ylab("Percent Methylation") + xlab("") + theme_bw() + 
            theme(axis.text=element_text(size=14),
                  axis.title=element_text(size=15,face="bold"),legend.position = "none") +
            ggtitle("STE20-like serine/threonine-protein kinase") + ylim(0,95)


HH_snp2 <- df.plot %>% filter(snp=="LS051669.1:9015") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")  +
            ylab("") + xlab("Treatment") + theme_bw() + 
            theme(axis.text=element_text(size=14),
                  axis.title=element_text(size=15,face="bold"),legend.position = "none") +
            ggtitle("Probable RNA-directed DNA polymerase from transposon BS") + ylim(0,95)


HH_snp3 <- df.plot %>% filter(snp=="LS070286.1:6277") %>% 
            ggplot(., aes(x=group, y=methylation, color=group, fill=group)) +
              stat_summary(fun.data = "mean_se", size = 2) +
              geom_jitter(width = 0.1, size=3, pch=21, color="black")  +
            ylab("") + xlab("") + theme_bw() + 
            theme(axis.text=element_text(size=14),
                  axis.title=element_text(size=15,face="bold")) + 
            ggtitle("Indolethylamine N-methyltransferase") + ylim(0,95)

Plots

AH_heat

HH_heat

(AH_snp1 + AH_snp2 + AH_snp3) / (HH_snp1 + HH_snp2 + HH_snp3) + plot_annotation(tag_levels = "A")
## Warning: Removed 2 rows containing missing values (geom_point).

Link to genes

/data/popgen/bedtools2/bin/bedtools closest -a /users/a/p/ap1/EcologicalGenomics/myresults/bedfiles_copepod/AA_HH_diffmeth.bed\
      -b /data/project_data/epigenetics/GCA_900241095.1_Aton1.0_genomic.fa_annotation_table.bed \
      -D b | \
      awk '!($10=="-")' > AA_HH_hits.bed # save filename is hits.bed
# count up number of hits

cat AA_HH_hits.bed | wc -l #34

# count number of unique named genes
cat AA_HH_hits.bed | cut -f 8 | sort | uniq -c


cat AA_HH_hits.bed | cut -f 10-12 | head
# getting only the unique hits as a bed file
awk '!seen[$8]++' AA_HH_hits.bed > AA_HH_uniqueHits.bed # save filename is hits.bed

Bedtools results

AA_AH_hits

AA_AH_hits <- read.csv("C:\\Epigenetics_data\\Epigenetics_data\\bed_files/AA_AH_hits.bed", sep="\t", header = FALSE)

knitr::kable(AA_AH_hits[,c(1:2,10:13,16:18)], format="pandoc")
V1 V2 V10 V11 V12 V13 V16 V17 V18
LS068114.1 6322 O30931 Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; PF00692; F:dUTP diphosphatase activity; F:magnesium ion binding; P:dUMP biosynthetic process; P:dUTP catabolic process; 0
LS108947.1 32 . . . . . . -1
LS109091.1 3297 . . . . . . -1
LS133954.1 2576 . . . . . . -1
LS138960.1 1140 . . . . . . -1
LS145990.1 2562 . . . . . . -1
LS166448.1 985 . . . . . . -1
LS208810.1 1111 . . . . . . -1
LS226718.1 949 . . . . . . -1
LS234884.1 1619 . . . . . . -1
LS246311.1 346 . . . . . . -1
LS274951.1 141 A7M7B9 A0ZT94 F1NQP6 Endonuclease III-like protein 1 {ECO:0000255|HAMAP-Rule:MF_03183} Bifunctional DNA N-glycosylase/DNA-(apurinic or apyrimidinic site) lyase {ECO:0000255|HAMAP-Rule:MF_03183}; PF00633;PF00730; F:4 iron, 4 sulfur cluster binding; F:class I DNA-(apurinic or apyrimidinic site) endonuclease activity; F:DNA-(apurinic or apyrimidinic site) endonuclease activity; F:double-stranded DNA binding; F:metal ion binding; F:oxidized pyrimidine nucleobase lesion DNA N-glycosylase activity; P:base-excision repair, AP site formation; P:nucleotide-excision repair, DNA incision, 5’-to lesion; 0
LS283868.1 874 . . . . . . -1
LS328870.1 810 P04323 Retrovirus-related Pol polyprotein from transposon 17.6 PF17921;PF17917;PF00078; F:aspartic-type endopeptidase activity; F:endonuclease activity; F:nucleic acid binding; F:RNA-directed DNA polymerase activity; P:DNA integration; 0
LS336361.1 132 . . . . . . -1
LS339920.1 871 . . . . . . -1
LS364758.1 572 . . . . . . -1
LS377801.1 203 . . . . . . -1

AA_AH_uniqueHits

AA_AH_unique_hits <- read.csv("C:\\Epigenetics_data\\Epigenetics_data\\bed_files/AA_AH_uniqueHits.bed", sep="\t", header = FALSE)

# only unique 
knitr::kable(AA_AH_unique_hits[unique(AA_AH_unique_hits$V10),c(1:2,10:13,16:18)],
             format="pandoc")
V1 V2 V10 V11 V12 V13 V16 V17 V18
3 LS274951.1 141 A7M7B9 A0ZT94 F1NQP6 Endonuclease III-like protein 1 {ECO:0000255|HAMAP-Rule:MF_03183} Bifunctional DNA N-glycosylase/DNA-(apurinic or apyrimidinic site) lyase {ECO:0000255|HAMAP-Rule:MF_03183}; PF00633;PF00730; F:4 iron, 4 sulfur cluster binding; F:class I DNA-(apurinic or apyrimidinic site) endonuclease activity; F:DNA-(apurinic or apyrimidinic site) endonuclease activity; F:double-stranded DNA binding; F:metal ion binding; F:oxidized pyrimidine nucleobase lesion DNA N-glycosylase activity; P:base-excision repair, AP site formation; P:nucleotide-excision repair, DNA incision, 5’-to lesion; 0
1 LS068114.1 6322 O30931 Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; PF00692; F:dUTP diphosphatase activity; F:magnesium ion binding; P:dUMP biosynthetic process; P:dUTP catabolic process; 0
2 LS108947.1 32 . . . . . . -1
4 LS328870.1 810 P04323 Retrovirus-related Pol polyprotein from transposon 17.6 PF17921;PF17917;PF00078; F:aspartic-type endopeptidase activity; F:endonuclease activity; F:nucleic acid binding; F:RNA-directed DNA polymerase activity; P:DNA integration; 0

AA_HH_hits

AA_HH_hits <- read.csv("C:\\Epigenetics_data\\Epigenetics_data\\bed_files/AA_HH_hits.bed", sep="\t", header = FALSE)

knitr::kable(AA_HH_hits[,c(1:2,10:13,16:18)], format="pandoc")
V1 V2 V10 V11 V12 V13 V16 V17 V18
LS042748.1 7742 Q91V01 B1B362 O35131 Q8BNH6 Lysophospholipid acyltransferase 5 1-acylglycerophosphocholine O-acyltransferase;1-acylglycerophosphoethanolamine O-acyltransferase;1-acylglycerophosphoserine O-acyltransferase;Lysophosphatidylcholine acyltransferase;Lysophosphatidylcholine acyltransferase 3;Lysophosphatidylethanolamine acyltransferase;Lysophosphatidylserine acyltransferase;Membrane-bound O-acyltransferase domain-containing protein 5; PF03062; F:1-acylglycerophosphocholine O-acyltransferase activity; P:phospholipid biosynthetic process; P:regulation of plasma lipoprotein particle levels; 0
LS042748.1 7742 Q91V01 B1B362 O35131 Q8BNH6 Lysophospholipid acyltransferase 5 1-acylglycerophosphocholine O-acyltransferase;1-acylglycerophosphoethanolamine O-acyltransferase;1-acylglycerophosphoserine O-acyltransferase;Lysophosphatidylcholine acyltransferase;Lysophosphatidylcholine acyltransferase 3;Lysophosphatidylethanolamine acyltransferase;Lysophosphatidylserine acyltransferase;Membrane-bound O-acyltransferase domain-containing protein 5; PF03062; F:1-acylglycerophosphocholine O-acyltransferase activity; P:phospholipid biosynthetic process; P:regulation of plasma lipoprotein particle levels; 0
LS043541.1 2223 Q87040 Pro-Pol polyprotein Pr125Pol; PF17921;PF00075;PF17919;PF00665;PF00078;PF18103;PF03539; F:aspartic-type endopeptidase activity; F:DNA-directed DNA polymerase activity; F:metal ion binding; F:RNA binding; F:RNA-directed DNA polymerase activity; F:RNA-DNA hybrid ribonuclease activity; P:DNA integration; P:DNA recombination; P:establishment of integrated proviral latency; P:viral entry into host cell; P:viral genome integration into host DNA; P:viral penetration into host nucleus; 0
LS043710.1 571 Q9HDB9 Endogenous retrovirus group K member 5 Gag polyprotein HERV-K(II) Gag protein;HERV-K_3q12.3 provirus ancestral Gag polyprotein; PF02337;PF00607;PF00098; F:nucleic acid binding; F:structural molecule activity; F:zinc ion binding; P:viral process; 0
LS046185.1 8259 O73798 Insulin-like growth factor 1 receptor PF00757;PF07714;PF01030; F:ATP binding; F:insulin receptor substrate binding; F:insulin-like growth factor binding; F:insulin-like growth factor-activated receptor activity; F:metal ion binding; F:phosphatidylinositol 3-kinase binding; F:protein tyrosine kinase activity; P:insulin-like growth factor receptor signaling pathway; P:multicellular organism development; P:oocyte maturation; P:protein autophosphorylation; P:protein tetramerization; 0
LS049748.1 4514 O55092 STE20-like serine/threonine-protein kinase STE20-related serine/threonine-protein kinase; PF00069;PF12474; F:ATP binding; F:protein homodimerization activity; F:protein serine/threonine kinase activity; P:apoptotic process; P:protein autophosphorylation; P:regulation of cell migration; P:regulation of focal adhesion assembly; 0
LS051669.1 9015 Q95SX7 Probable RNA-directed DNA polymerase from transposon BS Reverse transcriptase; PF14529;PF00078; F:RNA-directed DNA polymerase activity; P:transposition, DNA-mediated; 0
LS051669.1 9015 Q95SX7 Probable RNA-directed DNA polymerase from transposon BS Reverse transcriptase; PF14529;PF00078; F:RNA-directed DNA polymerase activity; P:transposition, DNA-mediated; 0
LS051669.1 9015 Q95SX7 Probable RNA-directed DNA polymerase from transposon BS Reverse transcriptase; PF14529;PF00078; F:RNA-directed DNA polymerase activity; P:transposition, DNA-mediated; 0
LS051734.1 5459 O30931 Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; PF00692; F:dUTP diphosphatase activity; F:magnesium ion binding; P:dUMP biosynthetic process; P:dUTP catabolic process; 0
LS055310.1 71 . . . . . . -1
LS056368.1 228 . . . . . . -1
LS058076.1 184 . . . . . . -1
LS062329.1 6106 Q0P5D3 G1/S-specific cyclin-D2 PF02984;PF00134; F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; 1368
LS062329.1 6106 Q0P5D3 G1/S-specific cyclin-D2 PF02984;PF00134; F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; 1368
LS062329.1 6147 Q0P5D3 G1/S-specific cyclin-D2 PF02984;PF00134; F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; 1327
LS062329.1 6147 Q0P5D3 G1/S-specific cyclin-D2 PF02984;PF00134; F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; 1327
LS064520.1 7213 . . . . . . -1
LS064920.1 4017 . . . . . . -1
LS064920.1 4018 . . . . . . -1
LS068114.1 6322 O30931 Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; PF00692; F:dUTP diphosphatase activity; F:magnesium ion binding; P:dUMP biosynthetic process; P:dUTP catabolic process; 0
LS070286.1 6277 O97972 Indolethylamine N-methyltransferase Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; PF01234; F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; P:amine metabolic process; P:methylation; P:response to toxic substance; -1631
LS070286.1 6277 O97972 Indolethylamine N-methyltransferase Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; PF01234; F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; P:amine metabolic process; P:methylation; P:response to toxic substance; -1631
LS070286.1 6280 O97972 Indolethylamine N-methyltransferase Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; PF01234; F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; P:amine metabolic process; P:methylation; P:response to toxic substance; -1634
LS070286.1 6280 O97972 Indolethylamine N-methyltransferase Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; PF01234; F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; P:amine metabolic process; P:methylation; P:response to toxic substance; -1634
LS075862.1 65 . . . . . . -1
LS082036.1 4666 . . . . . . -1
LS082036.1 4896 . . . . . . -1
LS091612.1 430 . . . . . . -1
LS106615.1 2104 . . . . . . -1
LS108711.1 492 . . . . . . -1
LS108711.1 666 . . . . . . -1
LS109311.1 448 . . . . . . -1
LS109311.1 449 . . . . . . -1
LS109311.1 456 . . . . . . -1
LS109311.1 457 . . . . . . -1
LS109311.1 466 . . . . . . -1
LS109311.1 467 . . . . . . -1
LS138960.1 1421 . . . . . . -1
LS138960.1 1936 . . . . . . -1
LS138960.1 2683 . . . . . . -1
LS143812.1 50 . . . . . . -1
LS148351.1 2003 P04323 Retrovirus-related Pol polyprotein from transposon 17.6 PF17921;PF17917;PF00078; F:aspartic-type endopeptidase activity; F:endonuclease activity; F:nucleic acid binding; F:RNA-directed DNA polymerase activity; P:DNA integration; -2
LS156187.1 2187 . . . . . . -1
LS156187.1 2192 . . . . . . -1
LS156187.1 2194 . . . . . . -1
LS156187.1 2203 . . . . . . -1
LS156187.1 2204 . . . . . . -1
LS156187.1 2371 . . . . . . -1
LS166448.1 985 . . . . . . -1
LS166742.1 369 . . . . . . -1
LS175657.1 327 . . . . . . -1
LS175657.1 336 . . . . . . -1
LS186826.1 68 . . . . . . -1
LS190825.1 1484 . . . . . . -1
LS194438.1 77 . . . . . . -1
LS196635.1 188 . . . . . . -1
LS197389.1 867 . . . . . . -1
LS203364.1 253 Q8I7P9 Retrovirus-related Pol polyprotein from transposon opus PF17921;PF17917;PF00665;PF00078; F:endonuclease activity; F:nucleic acid binding; F:peptidase activity; F:RNA-directed DNA polymerase activity; P:DNA integration; P:transposition, DNA-mediated; 353
LS216243.1 1374 . . . . . . -1
LS219679.1 194 . . . . . . -1
LS226718.1 959 . . . . . . -1
LS226718.1 1014 . . . . . . -1
LS226733.1 658 B9L823 Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; PF00692; F:dUTP diphosphatase activity; F:magnesium ion binding; P:dUMP biosynthetic process; P:dUTP catabolic process; 0
LS226733.1 658 Q9J5G5 Deoxyuridine 5’-triphosphate nucleotidohydrolase dUTP pyrophosphatase; PF00692; F:dUTP diphosphatase activity; F:magnesium ion binding; P:dUMP biosynthetic process; P:dUTP catabolic process; 0
LS231127.1 1543 . . . . . . -1
LS231976.1 467 . . . . . . -1
LS231976.1 474 . . . . . . -1
LS232179.1 598 . . . . . . -1
LS239438.1 1057 . . . . . . -1
LS240607.1 71 . . . . . . -1
LS241916.1 1021 A4FUB7 Gypsy retrotransposon integrase-like protein 1 PF17921; F:nucleic acid binding; P:DNA integration; 0
LS243346.1 778 A4FUB7 Gypsy retrotransposon integrase-like protein 1 PF17921; F:nucleic acid binding; P:DNA integration; 5
LS243346.1 781 A4FUB7 Gypsy retrotransposon integrase-like protein 1 PF17921; F:nucleic acid binding; P:DNA integration; 2
LS244160.1 273 . . . . . . -1
LS255237.1 1275 . . . . . . -1
LS279743.1 1105 . . . . . . -1
LS279852.1 203 . . . . . . -1
LS283678.1 900 . . . . . . -1
LS290113.1 1093 . . . . . . -1
LS291054.1 731 . . . . . . -1
LS294040.1 224 . . . . . . -1
LS311600.1 407 . . . . . . -1
LS315617.1 700 . . . . . . -1
LS316650.1 231 . . . . . . -1
LS320205.1 231 . . . . . . -1
LS320226.1 957 . . . . . . -1
LS335360.1 374 . . . . . . -1
LS338923.1 718 . . . . . . -1
LS338923.1 719 . . . . . . -1
LS338923.1 735 . . . . . . -1
LS348101.1 994 . . . . . . -1
LS351877.1 271 . . . . . . -1
LS359373.1 273 . . . . . . -1

AA_HH_uniqueHits

AA_HH_unique_hits <- read.csv("C:\\Epigenetics_data\\Epigenetics_data\\bed_files/AA_HH_uniqueHits.bed",
                              sep="\t", header = FALSE)

# only unique 
knitr::kable(AA_HH_unique_hits[unique(AA_HH_unique_hits$V10),c(1:2,10:13,16:18)],
             format="pandoc")
V1 V2 V10 V11 V12 V13 V16 V17 V18
12 LS062329.1 6106 Q0P5D3 G1/S-specific cyclin-D2 PF02984;PF00134; F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; 1368
10 LS051734.1 5459 O30931 Deoxyuridine 5’-triphosphate nucleotidohydrolase {ECO:0000255|HAMAP-Rule:MF_00116} dUTP pyrophosphatase {ECO:0000255|HAMAP-Rule:MF_00116}; PF00692; F:dUTP diphosphatase activity; F:magnesium ion binding; P:dUMP biosynthetic process; P:dUTP catabolic process; 0
14 LS070286.1 6277 O97972 Indolethylamine N-methyltransferase Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; PF01234; F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; P:amine metabolic process; P:methylation; P:response to toxic substance; -1631
6 LS049748.1 4514 O55092 STE20-like serine/threonine-protein kinase STE20-related serine/threonine-protein kinase; PF00069;PF12474; F:ATP binding; F:protein homodimerization activity; F:protein serine/threonine kinase activity; P:apoptotic process; P:protein autophosphorylation; P:regulation of cell migration; P:regulation of focal adhesion assembly; 0
5 LS046185.1 8259 O73798 Insulin-like growth factor 1 receptor PF00757;PF07714;PF01030; F:ATP binding; F:insulin receptor substrate binding; F:insulin-like growth factor binding; F:insulin-like growth factor-activated receptor activity; F:metal ion binding; F:phosphatidylinositol 3-kinase binding; F:protein tyrosine kinase activity; P:insulin-like growth factor receptor signaling pathway; P:multicellular organism development; P:oocyte maturation; P:protein autophosphorylation; P:protein tetramerization; 0
13 LS062329.1 6106 Q0P5D3 G1/S-specific cyclin-D2 PF02984;PF00134; F:cyclin-dependent protein serine/threonine kinase regulator activity; F:protein kinase binding; P:adult locomotory behavior; P:cell division; P:cellular response to X-ray; P:long-term memory; P:mitotic cell cycle phase transition; P:positive regulation of G1/S transition of mitotic cell cycle; P:protein phosphorylation; P:regulation of cyclin-dependent protein serine/threonine kinase activity; 1368
4 LS043710.1 571 Q9HDB9 Endogenous retrovirus group K member 5 Gag polyprotein HERV-K(II) Gag protein;HERV-K_3q12.3 provirus ancestral Gag polyprotein; PF02337;PF00607;PF00098; F:nucleic acid binding; F:structural molecule activity; F:zinc ion binding; P:viral process; 0
1 LS042748.1 7742 Q91V01 B1B362 O35131 Q8BNH6 Lysophospholipid acyltransferase 5 1-acylglycerophosphocholine O-acyltransferase;1-acylglycerophosphoethanolamine O-acyltransferase;1-acylglycerophosphoserine O-acyltransferase;Lysophosphatidylcholine acyltransferase;Lysophosphatidylcholine acyltransferase 3;Lysophosphatidylethanolamine acyltransferase;Lysophosphatidylserine acyltransferase;Membrane-bound O-acyltransferase domain-containing protein 5; PF03062; F:1-acylglycerophosphocholine O-acyltransferase activity; P:phospholipid biosynthetic process; P:regulation of plasma lipoprotein particle levels; 0
9 LS051669.1 9015 Q95SX7 Probable RNA-directed DNA polymerase from transposon BS Reverse transcriptase; PF14529;PF00078; F:RNA-directed DNA polymerase activity; P:transposition, DNA-mediated; 0
7 LS051669.1 9015 Q95SX7 Probable RNA-directed DNA polymerase from transposon BS Reverse transcriptase; PF14529;PF00078; F:RNA-directed DNA polymerase activity; P:transposition, DNA-mediated; 0
8 LS051669.1 9015 Q95SX7 Probable RNA-directed DNA polymerase from transposon BS Reverse transcriptase; PF14529;PF00078; F:RNA-directed DNA polymerase activity; P:transposition, DNA-mediated; 0
11 LS055310.1 71 . . . . . . -1
3 LS043541.1 2223 Q87040 Pro-Pol polyprotein Pr125Pol; PF17921;PF00075;PF17919;PF00665;PF00078;PF18103;PF03539; F:aspartic-type endopeptidase activity; F:DNA-directed DNA polymerase activity; F:metal ion binding; F:RNA binding; F:RNA-directed DNA polymerase activity; F:RNA-DNA hybrid ribonuclease activity; P:DNA integration; P:DNA recombination; P:establishment of integrated proviral latency; P:viral entry into host cell; P:viral genome integration into host DNA; P:viral penetration into host nucleus; 0
15 LS070286.1 6277 O97972 Indolethylamine N-methyltransferase Aromatic alkylamine N-methyltransferase;Thioether S-methyltransferase; PF01234; F:amine N-methyltransferase activity; F:S-adenosyl-L-methionine:beta-alanine N-methyltransferase activity; F:thioether S-methyltransferase activity; P:amine metabolic process; P:methylation; P:response to toxic substance; -1631
2 LS042748.1 7742 Q91V01 B1B362 O35131 Q8BNH6 Lysophospholipid acyltransferase 5 1-acylglycerophosphocholine O-acyltransferase;1-acylglycerophosphoethanolamine O-acyltransferase;1-acylglycerophosphoserine O-acyltransferase;Lysophosphatidylcholine acyltransferase;Lysophosphatidylcholine acyltransferase 3;Lysophosphatidylethanolamine acyltransferase;Lysophosphatidylserine acyltransferase;Membrane-bound O-acyltransferase domain-containing protein 5; PF03062; F:1-acylglycerophosphocholine O-acyltransferase activity; P:phospholipid biosynthetic process; P:regulation of plasma lipoprotein particle levels; 0

Extra notes

SCP commandline

scp path/on/the/system ap1@pbio381.edu:~/
vim AA_HH_hits.bed 

:set nowrap # to see all the columns
awk '!seen[$8]++' hits.bed | cut -f11- > uniqueHitsNames.bed

that code will remove duplicate lines and only print columns with the name of the gene and the annotation info

 try adding sep="\t"

    also. read.table has issues sometimes with spaces. read.csv can be better